Loading required package: viridisLite
To calculate the allele frequencies, I began with the bcf output from mega-post-bcf-exploratory-snakeflow so that the data has gone through all our normal filtering steps. The basic steps are to get the genotype likelihoods, convert the bcf to a vcf and subset by population, and then calculate the allele frequencies using angsd.
mamba activate bcftools1.15 #load in bcftools environment
bcftools +tag2tag main.bcf -- -r -PL-to-GL > genolikes.bcf # converts the FORMAT/PL source tag to FORMAT/GL, since angsd needs the tag
bcftools view genolikes.bcf -S sample_subset.txt -O v -o subset_genolikes.vcf # given the list of samples, subsets and saves as a vcf
angsd -vcf-gl subset_genolikes.vcf -fai genome.fasta.fai -nind 10 -domaf 3 -out angsd_outputs/subset # angsd calculates the minor allele frequencies based on an assumed known major and minor allele but takes the uncertainty of the minor allele into account (-domaf 3) and the number of individuals in changed based on how many you have (this can be grabbed with a quick wc -l)
Now we use the .mafs.gz outputs of angsd for our plots. What makes these version 2 is that the first time I calculated allele frequencies with the corrected metadata (and thus the larger Mid-Atlantic group), I did it from the mega-non-model-wgs-snakeflow bcf (which is filtered less).
In Eric’s example, he filtered by 30 for 64 individuals, so you can play with how heavily you want to filter the data. For this, I prefer 70-85% and typically choose 80% and always round down. Having a higher number of individuals have information at that region makes me more sure of the results, but it might be reasonable to start at 50% or so just to check it out before stricter filtering.
Now, we do the comparisons:
natla_mida <- inner_join(natla_freqs,
mida_freqs,
by = c("chromo", "position"),
suffix = c("_n", "_m")) %>% # joins together the northern and mid-atlantic allele frequencies by the chromosome and position, keeping the observations that match and keeps the calculated values separate
mutate(ave_freq = (unknownEM_n + unknownEM_m) / 2, # the average frequency from -domaf 3
abs_diff = abs(unknownEM_n - unknownEM_m)) # the absolute difference in those frequencies
natla_grtl <- inner_join(natla_freqs,
grtl_freqs,
by = c("chromo", "position"),
suffix = c("_n", "_g")) %>%
mutate(ave_freq = (unknownEM_n + unknownEM_g) / 2,
abs_diff = abs(unknownEM_n - unknownEM_g))
mida_grtl <- inner_join(mida_freqs,
grtl_freqs,
by = c("chromo", "position"),
suffix = c("_m", "_g")) %>%
mutate(ave_freq = (unknownEM_m + unknownEM_g) / 2,
abs_diff = abs(unknownEM_m - unknownEM_g))
natla_finl <- inner_join(natla_freqs,
finl_freqs,
by = c("chromo", "position"),
suffix = c("_n", "_f")) %>%
mutate(ave_freq = (unknownEM_n + unknownEM_f) / 2,
abs_diff = abs(unknownEM_n - unknownEM_f))
mida_finl <- inner_join(mida_freqs,
finl_freqs,
by = c("chromo", "position"),
suffix = c("_m", "_f")) %>%
mutate(ave_freq = (unknownEM_m + unknownEM_f) / 2,
abs_diff = abs(unknownEM_m - unknownEM_f))
Now I’m going to check the distribution of our data to make plotting it easier, and pick a cutoff that will retain most of the information.
nxm_check <- ggplot(data = natla_mida,
mapping = aes(x = ave_freq,
y = abs_diff)) +
geom_hex(binwidth = 0.001) +
scale_fill_viridis_c()
nxm_check
nxg_check <- ggplot(data = natla_grtl,
mapping = aes(ave_freq,
y = abs_diff)) +
geom_hex(binwidth = 0.001) +
scale_fill_viridis_c()
nxg_check
mxg_check <- ggplot(mida_grtl,
mapping = aes(x = ave_freq,
y = abs_diff)) +
geom_hex(binwidth = 0.001) +
scale_fill_viridis_c()
mxg_check
nxf_check <- ggplot(natla_finl,
mapping = aes(x = ave_freq,
y = abs_diff)) +
geom_hex(binwidth = 0.001) +
scale_fill_viridis_c()
nxf_check
mxf_check <- ggplot(mida_finl,
mapping = aes(x = ave_freq,
y = abs_diff)) +
geom_hex(binwidth = 0.001) +
scale_fill_viridis_c()
mxf_check
I want to try to keep this in the same range as a 50,000 size sliding window, so I’m going to filter my data for an absolute difference greater than 0.15
Then I set up the data for plotting by getting the center position of each chromosome, so that the labels are centered on each chromosome and not repeated.
Finally, I plot the absolute differences of the allele frequencies across the entire genome, focusing on the sections that the pairwise fst analysis showed peaks in fst value.
Starting with the Northern Atlantic versus the Mid-Atlantic Populations
Northern Atlantic versus the Great Lakes Populations
Mid-Atlantic versus the Great Lakes Populations
Northern Atlantic versus the Finger Lakes Populations
Mid-Atlantic versus the Finger Lakes Populations
There isn’t anything popping up other than the big spike on chromosome 2, even though we see some spikes above 0.25 in Fst on some of the other chromosomes. Let’s just compare the spike regions for chromosome 2.
Success! We’re seeing the same spike in absolute difference of allele frequency in the Mid-Atlantic populations when compared the either Great Lakes or Finger Lakes that we see in the Northern Atlantic vs Mid-Atlantic comparison. Because the allele frequencies aren’t very different between the Northern Atlantic and Great/Finger Lakes at that highly variable site, it looks like the alewife from Northern Atlantic populations (Miramichi and Saco River) may have been the source population for the Great Lakes and the Finger Lakes.
Testing out the differences between Great Lakes and Finger Lakes, which group pretty strongly together in PCA.